library(data.table) # data.table type
library(tidyr) # gather()
library(kableExtra) # kables for html document
library(naniar) # vis_miss()
library(corrplot) # corrplot()
library(caret) # findCorrelation()
library(REdaS) # KMOS(), bart_spher()
library(psych) # KMO(), fa(), principal()
options(scipen = 999)
# library(pastecs)
# library(dplyr)
# library(rcompanion)
# library(GPArotation)
# library(nFactors)
# library(knitr)
We start by loading both data files.
descriptions <- fread("Data File_Case Study_Factor Analysis_Variables.csv")
df <- read.csv("Data File_Case_Study_Factor Analysis_MD.csv")
We select only the relevant columns which correspond to question
numbers qd1 to qd33 inside
df.
df1 <- df[,c(9:41)]
We check for missing values.
vis_miss(df1)
Since questions 22, 26, 28 and 29 contain missing values, we delete the corresponding rows.
df1 <- na.omit(df1)
vis_miss(df1)
Now, our data frame df1 has no more missing values.
We proceed with this analysis assuming independence between the factor we will attempt to extract.
We compute the correlation matrix and plot it.
raqMatrix <- cor(df1)
corrplot(raqMatrix, tl.cex = 0.8)
We can see some strong correlations, indicated by bigger and darker
blue dots, between different questions. Furthermore, we notice that
qd4 and qd23 seem to be the only two items
with low correlations with all the other items. They are standalone
items.
A strong correlation between two variables is most likely a signal of redundancy. In regards to the correlation plot above, there are not very clear patterns, but we can observe some variables are highly correlated between one another, which is good since we want to perform a factor analysis. More precisely, we are interested in the underlying factors causing those variables to move the way that they move. A weaker correlation indicates that there are different factors affecting the variables, meaning they do not describe the same domain.
Listed below are the variables displaying higher correlation, with a selected threshold of 0.65.
n = nrow(raqMatrix)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
if (raqMatrix[i, j] >= 0.65) {
print(paste(colnames(df1)[i],
"and",
colnames(df1)[j],
"have a correlation of",
round(raqMatrix[i, j], 2)))
}
}
}
## [1] "qd1 and qd10 have a correlation of 0.84"
## [1] "qd1 and qd20 have a correlation of 0.84"
## [1] "qd1 and qd27 have a correlation of 0.84"
## [1] "qd2 and qd5 have a correlation of 0.76"
## [1] "qd2 and qd7 have a correlation of 0.82"
## [1] "qd2 and qd12 have a correlation of 0.81"
## [1] "qd2 and qd16 have a correlation of 0.69"
## [1] "qd3 and qd11 have a correlation of 0.76"
## [1] "qd3 and qd13 have a correlation of 0.76"
## [1] "qd3 and qd30 have a correlation of 0.74"
## [1] "qd5 and qd7 have a correlation of 0.76"
## [1] "qd5 and qd12 have a correlation of 0.73"
## [1] "qd5 and qd16 have a correlation of 0.71"
## [1] "qd6 and qd8 have a correlation of 0.74"
## [1] "qd6 and qd18 have a correlation of 0.73"
## [1] "qd6 and qd25 have a correlation of 0.81"
## [1] "qd7 and qd12 have a correlation of 0.81"
## [1] "qd7 and qd16 have a correlation of 0.73"
## [1] "qd8 and qd25 have a correlation of 0.72"
## [1] "qd9 and qd14 have a correlation of 0.83"
## [1] "qd9 and qd19 have a correlation of 0.88"
## [1] "qd9 and qd21 have a correlation of 0.79"
## [1] "qd9 and qd24 have a correlation of 0.85"
## [1] "qd10 and qd20 have a correlation of 0.83"
## [1] "qd10 and qd27 have a correlation of 0.83"
## [1] "qd11 and qd13 have a correlation of 0.74"
## [1] "qd11 and qd30 have a correlation of 0.7"
## [1] "qd12 and qd16 have a correlation of 0.69"
## [1] "qd13 and qd30 have a correlation of 0.83"
## [1] "qd14 and qd19 have a correlation of 0.84"
## [1] "qd14 and qd21 have a correlation of 0.82"
## [1] "qd14 and qd24 have a correlation of 0.87"
## [1] "qd15 and qd17 have a correlation of 0.86"
## [1] "qd15 and qd32 have a correlation of 0.83"
## [1] "qd17 and qd32 have a correlation of 0.84"
## [1] "qd18 and qd25 have a correlation of 0.7"
## [1] "qd19 and qd21 have a correlation of 0.79"
## [1] "qd19 and qd24 have a correlation of 0.85"
## [1] "qd20 and qd27 have a correlation of 0.84"
## [1] "qd21 and qd24 have a correlation of 0.84"
## [1] "qd22 and qd29 have a correlation of 0.84"
## [1] "qd22 and qd33 have a correlation of 0.81"
## [1] "qd26 and qd28 have a correlation of 0.66"
## [1] "qd26 and qd31 have a correlation of 0.67"
## [1] "qd28 and qd31 have a correlation of 0.68"
## [1] "qd29 and qd33 have a correlation of 0.79"
rm(i, j, n)
This allows us to separate the questionnaire items into clusters by
grouping together variables which are highly correlated. We can obtain
exactly the same groups by using the hclust option when
plotting the correlation matrix. Rows and columns are swapped so items
who are highly correlated can be side by side.
corrplot(raqMatrix, order = 'hclust', tl.cex = 0.8, addrect = 10)
The clustering of the correlation matrix with 10 clusters and our grouping with threshold at 0.65 give the same results. We get the following 10 different groups:
This grouping is quite close to quality dimensions defined in the literature (see Table 1 of the assignment pdf). The differences would be the presence of a construct about the quality of materials, which is not listed in the dimensions of quality, and the fact that questions 4 and 23 should reflect the same underlying construct, “Distinctiveness/Prestige”, but it does not seem to be the case from experimental data, as the two variables are very weekly correlated.
To make sure that our data is suitable for Factor Analysis, we want to test to what extent our variables are correlated to one another. To further the analysis we measure the Sampling Adequacy based on the Kaiser-Meyer-Olking criterion:
#Kaiser-Meyer-Olkin Statistics
KMO(df1)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = df1)
## Overall MSA = 0.96
## MSA for each item =
## qd1 qd2 qd3 qd4 qd5 qd6 qd7 qd8 qd9 qd10 qd11 qd12 qd13 qd14 qd15 qd16
## 0.96 0.97 0.96 0.64 0.98 0.96 0.97 0.98 0.96 0.97 0.96 0.97 0.96 0.96 0.95 0.98
## qd17 qd18 qd19 qd20 qd21 qd22 qd23 qd24 qd25 qd26 qd27 qd28 qd29 qd30 qd31 qd32
## 0.95 0.97 0.96 0.97 0.97 0.93 0.90 0.96 0.96 0.97 0.97 0.96 0.93 0.97 0.96 0.96
## qd33
## 0.95
The KMO test measures the proportion of variance between variables which can be attributed to common variance. It takes values between 0 and 1, meaning that higher values of the test indicate that a variable is suitable for factor analysis. The measurement is computed with the following formula: \[MO_j=\frac{\sum_{i\neq j}r^2_{ij}}{\sum_{i \neq j} r^2_{ij}+\sum_{i\neq j}p^2_{ij}}\] where \(r_{ij}\) is the correlation coefficient of the variables indexed by \(i\) and \(j\), and \(p_{ij}\) is the partial correlation coefficient of those variables.
Our KMO mean value amounts to 0.96 which indicates that in average
variables are well suited for factor analysis. The individual MSA values
of variables except for qd4 are all above 0.9, which
corresponds to “marvelous” in correspondence with Kaiser’s nomenclature.
The value of qd4 is considered as “mediocre”. We must look
out for it in the next steps.
We will also take a look at the diagonal of the anti-image correlation matrix. This gives us the proportion of a variable’s variance that cannot be explained by any other variable.
# we use the KMO function from the psych package to retrieve the anti-image cor mat
anti_mat_diag = data.frame("Question" = 1:33,
"UniqueVar" = diag(KMO(df1)$ImCov))
ggplot(data = anti_mat_diag, aes(x = Question, y = UniqueVar)) +
geom_col() +
theme_classic() +
scale_x_continuous(breaks = seq(1, 33, 1)) +
scale_y_continuous(limits = c(0,1)) +
labs(title = "Diagonal values of anti-image correlation matrix",
y = "Proportion of unique variance")
rm(anti_mat_diag)
In our case, qd4 and qd23 have values above
0.5, indicating that their topics have not been sufficiently covered in
the questionnaire. They clearly result as outliers compared to all the
other variables.
Another way to judge whether our data is suitable for factor analysis would be Bartlett’s test. The null hypothesis of this test is that the sample originates from a population, where all variables are uncorrelated, or in other words that the correlation matrix equals the identity matrix. The test is more suited for cases in which each variable has less than 5 observations, because it is very sensitive within large samples and thus might not be a good fit for us.
The test results highly significant, meaning we can reject the null hypothesis, the data is fit for factor analysis.
bart_spher(df1)
## Bartlett's Test of Sphericity
##
## Call: bart_spher(x = df1)
##
## X2 = 30857.911
## df = 528
## p-value < 2.22e-16
After having considered all the above elements, this data set looks
suitable for factor analysis. The only two items we need to pay special
attention to are qd4 and qd23, as they might
not be as well suited as the rest.
PAF is an exploratory factor analysis method (or Factor Extraction Method), it attempts to find the least number of factors accounting for common variance of a set of variables. It uses a reduced correlation matrix, with a diagonal containing communalities.
It assumes that the variance of the observed variables cannot be fully explained by the extracted factors, it decomposes the variance in communality and unique variance, without accounting for specific and error variability.
PAF is implemented with the fa function from
psych package.
We use the “varimax” rotation, which is designed to simplify the interpretation of the loadings. Varimax works by maximizing the sum of the variance of each squared factor loading. It pushes factors to become closer to 1 or 0, which facilitates interpretation.
The number of underlying factors has to be specified inside the
function. If we try to input a number of factors greater than 13,
fa returns an error. Thus, we test options for models going
from 1 factor to 13 and record each time some criteria aimed at helping
select the best model.
Those criterias are:
fit, which measures how well the factor model
reproduces the correlation matrix;objective, which is the value of the function that is
minimized by a maximum likelihood procedure;rms, which is the sum of the squared off-diagonal
residuals divided by the degrees of freedom;crms, which is the RMS adjusted for degrees of
freedom;TLI, which is the Tucker Lewis Index of factoring
reliability;BIC, which is the Bayesian information criterion.A good model exhibits high values of fit and
TLI, and low values of the others.
# we initiate and empty dataframe which will record our criteria values
fit_df <- matrix(nrow = 13, ncol = 6)
colnames(fit_df) <- c("fit", "objective", "rms", "crms", "TLI", "BIC")
fit_df <- as.data.frame(fit_df)
# we compute the factor analysis for nfactors between 1 and 13 and record criteria results in fit_df
for (i in 1:13) {
FA <- fa(df1, nfactors = i, rotate = "varimax", fm = "pa")
fit_df[i, 1] <- FA$fit
fit_df[i, 2] <- FA$objective
fit_df[i, 3] <- FA$rms
fit_df[i, 4] <- FA$crms
fit_df[i, 5] <- FA$TLI
fit_df[i, 6] <- FA$BIC
}
rm(i, FA)
Now, we can plot the values of each criteria with respect to the number of factors.
fit_df %>% gather() %>%
ggplot(aes(x = rep(1:13, ncol(fit_df)), y = value)) +
facet_wrap(~ key, scales = "free") +
geom_point() +
theme_classic() +
scale_x_continuous(breaks = seq(1, 13, 1)) +
geom_vline(xintercept = 8, linetype = "dashed") +
geom_vline(xintercept = 9, linetype = "dashed") +
labs(title = "Various criteria for Principal Axis Factoring",
x = "Number of factors",
y = "")
The dashed lines designate the number of factors which seem the most appropriate for our data. The improvement in criteria for numbers of factors greater than 9 is much slower and less significant. Depending on the criterion, we would choose 8 or 9, so we will test both models with and without including questions 4 and 23, since our exploratory analysis of the data indicated that those two variables probably will not be loaded onto factors, as they are standalone.
FA8 <- fa(df1, nfactors = 8, rotate = "varimax", fm = "pa")
FA8_423 <- fa(df1[, -c(4, 23)], nfactors = 8, rotate = "varimax", fm = "pa")
FA9 <- fa(df1, nfactors = 9, rotate = "varimax", fm = "pa")
FA9_423 <- fa(df1[, -c(4, 23)], nfactors = 9, rotate = "varimax", fm = "pa")
We examine the factor loadings of each model, using a plot similar to the correlation plot to quickly visualize them.
par(mfrow = c(2,2))
corrplot(t(FA8$loadings),
tl.cex = 0.7,
title = "PAF with 8 factors \n Loadings",
mar = c(0, 1, 3, 0))
corrplot(t(FA8_423$loadings),
tl.cex = 0.7,
title = "PAF with 8 factors excluding qd4 and qd23 \n Loadings",
mar = c(0, 1, 3, 0))
corrplot(t(FA9$loadings),
tl.cex = 0.7,
title = "PAF with 9 factors \n Loadings",
mar = c(0, 1, 3, 0))
corrplot(t(FA9_423$loadings),
tl.cex = 0.7,
title = "PAF with 9 factors excluding qd4 and qd23 \n Loadings",
mar = c(0, 1, 3, 0))
par(mfrow = c(1,1))
We remark the following:
FA8, questions 4 and 23 aren’t loaded on any factor.
Each other variable is clearly loaded on a unique factor.FA8_423, each variable is clearly loaded on one
factor like before, since we deleted the two variables which weren’t
associated on any of the 8 factors. If we decide to keep a model with 8
factors, we could delete qd4 and qd23 as they
will not be loaded onto a factor in any case.FA9, questions 4 and 23 are loaded onto one factor
together and all other variables are loaded on one of the 8 other
factors. The loadings of qd4 and qd23 are the
lowest however, at 0.414 and 0.474 respectively.FA9_423, the ninth factor has no variable loaded on
it, as we have deleted questions 4 and 23 and the other items are
arranged on 8 factors. This model is not appropriate for the data and
should be discarded.In summary, there are two possible models:
qd4 and
qd23;qd4 and
qd23 are loaded onto a factor together.We discard FA8 and FA9_423 and display the
numerical loadings greater than 0.3 of the two remaining models.
rm(FA8, FA9_423)
print(FA8_423$loadings, cutoff = 0.3, sort = TRUE)
##
## Loadings:
## PA2 PA1 PA5 PA4 PA3 PA7 PA6 PA8
## qd9 0.827
## qd14 0.851
## qd19 0.832
## qd21 0.785
## qd24 0.860
## qd2 0.723
## qd5 0.645
## qd7 0.738
## qd12 0.711
## qd16 0.616
## qd1 0.797
## qd10 0.749
## qd20 0.772
## qd27 0.747
## qd3 0.759
## qd11 0.749
## qd13 0.739
## qd30 0.677
## qd22 0.837
## qd29 0.813
## qd33 0.780
## qd6 0.741
## qd8 0.341 0.558
## qd18 0.601
## qd25 0.762
## qd15 0.749
## qd17 0.744
## qd32 0.715
## qd26 0.620
## qd28 0.672
## qd31 0.660
##
## PA2 PA1 PA5 PA4 PA3 PA7 PA6 PA8
## SS loadings 4.569 3.577 3.404 3.121 2.781 2.675 2.324 1.935
## Proportion Var 0.147 0.115 0.110 0.101 0.090 0.086 0.075 0.062
## Cumulative Var 0.147 0.263 0.373 0.473 0.563 0.649 0.724 0.787
print(FA9$loadings, cutoff = 0.3, sort = TRUE)
##
## Loadings:
## PA2 PA1 PA5 PA4 PA3 PA7 PA6 PA8 PA9
## qd9 0.830
## qd14 0.853
## qd19 0.834
## qd21 0.788
## qd24 0.862
## qd2 0.727
## qd5 0.649
## qd7 0.743
## qd12 0.716
## qd16 0.619
## qd1 0.796
## qd10 0.747
## qd20 0.772
## qd27 0.748
## qd3 0.761
## qd11 0.750
## qd13 0.736
## qd30 0.674
## qd22 0.835
## qd29 0.813
## qd33 0.779
## qd6 0.735
## qd8 0.352 0.559
## qd18 0.589
## qd25 0.746
## qd15 0.749
## qd17 0.743
## qd32 0.712
## qd26 0.616
## qd28 0.661
## qd31 0.653
## qd4 0.414
## qd23 0.474
##
## PA2 PA1 PA5 PA4 PA3 PA7 PA6 PA8 PA9
## SS loadings 4.632 3.667 3.412 3.134 2.783 2.559 2.320 1.882 0.517
## Proportion Var 0.140 0.111 0.103 0.095 0.084 0.078 0.070 0.057 0.016
## Cumulative Var 0.140 0.251 0.355 0.450 0.534 0.612 0.682 0.739 0.755
We notice that the factors created by PAF are the same groups we
identified with our clustering of the correlation matrix. In both
models, qd8, which was grouped with the features construct,
is partially loaded onto factor PA1, which corresponds to the
performance construct. We fetch its description.
descriptions[name %in% "qd8", ]
## name Label
## 1: qd8 the extra features offered in my smartphone usually function
Some people answering might consider not only extra features but any
features of their smartphone and evaluate their performance, i.e. how
they function. This might explain why qd8 is partially
loaded onto the performance construct. However, it remains mainly loaded
onto the features construct, with a value of 0.558 or 0.559 according to
the model.
In model FA9, loadings of qd4 and
qd23 on their factor are the weakest of all “main”
loadings. This could indicate that keeping qd4 and
qd23 might not be totally appropriate and is up to
discussion.
When looking at the communalities of our models, we get the following results:
sort(FA8_423$communality)
## qd18 qd16 qd26 qd28 qd8 qd31 qd11 qd5
## 0.6266777 0.6402010 0.6551763 0.6849299 0.6855023 0.7001560 0.7169741 0.7224583
## qd33 qd3 qd21 qd30 qd12 qd25 qd2 qd13
## 0.7612886 0.7629531 0.7706646 0.7756551 0.7829168 0.8004648 0.8052923 0.8058405
## qd32 qd29 qd10 qd7 qd20 qd27 qd6 qd9
## 0.8295319 0.8312862 0.8333518 0.8357790 0.8360730 0.8390860 0.8392665 0.8406325
## qd19 qd14 qd1 qd15 qd22 qd17 qd24
## 0.8435205 0.8526145 0.8538126 0.8560076 0.8572538 0.8573949 0.8826324
sort(FA9$communality)
## qd4 qd23 qd16 qd26 qd18 qd28 qd8 qd31
## 0.1860315 0.2740969 0.6409344 0.6570266 0.6693727 0.6805048 0.6954544 0.7034429
## qd11 qd5 qd33 qd3 qd21 qd30 qd12 qd25
## 0.7194675 0.7286179 0.7617402 0.7666282 0.7707646 0.7740474 0.7848395 0.7916454
## qd13 qd2 qd32 qd29 qd10 qd20 qd7 qd27
## 0.8026750 0.8047862 0.8286370 0.8325405 0.8333801 0.8365920 0.8368335 0.8404022
## qd9 qd6 qd19 qd14 qd1 qd22 qd17 qd15
## 0.8405512 0.8421117 0.8439347 0.8525442 0.8531018 0.8557809 0.8564535 0.8588101
## qd24
## 0.8825049
This further indicates that questions 4 and 23 might not be
appropriate in for factor analysis, as their communalities inside
FA9 are extremely low. The communality of an item is the
sum of the squared factor loadings for that item. It measures the
proportion of the item’s variance that can be explained by the
underlying factors. If the communality equals 1 then a complete
replication of the variable can me obtained through the sole use of
factors.
We observe the scree plots of our two models.
ggplot(mapping = aes(x = 1:length(FA8_423$values),
y = FA8_423$values,
color = (FA8_423$values >= 1))) +
geom_point() +
geom_hline(yintercept = 1, linetype = "dashed") +
theme_classic() +
labs(title = "Scree plot of FA8_423",
x = "Factor Number",
y = "Eigenvalue",
color = "Eigenvalue >= 1")
ggplot(mapping = aes(x = 1:length(FA9$values),
y = FA9$values,
color = (FA9$values >= 1))) +
geom_point() +
geom_hline(yintercept = 1, linetype = "dashed") +
theme_classic() +
labs(title = "Scree plot of FA9",
x = "Factor Number",
y = "Eigenvalue",
color = "Eigenvalue >= 1")
The eigenvalues are plotted above, the horizontal line represents the
significance threshold of 1. If we would only select based on the
eigenvalues bigger than 1, then our model would have only 5 factors.
However, due to all our previous analysis, this does not seem to be the
best choice. The elbow rule would probably cut at 8 factors for
FA8_423, as there is a bigger gap between 8 and 9. However,
FA9 does not have such a gap and the descent is quite
smooth, so it is more difficult to determine.
After reviewing all the information we gathered about our model, we
decide to keep FA8_423 over FA9.
Questions 4 and 23 Showed very little correlation with each other and even if they should theoretically represent the same construct of distinctiveness and prestige, the construct was not explored enough through questions in the survey. Every other construct had between 3 and 5 questions and the questions exhibit high correlations when they are in the same group.
Furthermore, the BIC is lowest for 8 factors and starts slowly rising
again once we get past 8. The loadings of qd4 and
qd23 in the FA9 model remain lower than all
other loadings of items inside their construct, being even lower than
0.5. When we looked at the anti-image correlation matrix diagonal, the
proportions of unique variance in questions 4 and 23 were dominating the
rest, indicating that the rest of the data set cannot explain most of
their variance. Moreoever, their communalities were very low in
FA9.
These reasons combined made us choose model FA8_423. Its
constructs are the ones we listed when we inspected the correlation
matrix (except of course the constructs of questions 4 and 23). The
loading patterns of FA8_423, as we discussed in the
previous point, are clear.
Like with PAF, we run models with number of components ranging from 1 to 33 and will examine some criteria to get an idea of which ones best fit our data.
For PCA, the criteria are:
fit, which measures how well the components model
reproduces the correlation matrix;fit.off, which measures how well the off-diagonal
elements of the correlation matrix are reproduced;objective, which is the value of the function that is
minimized by a maximum likelihood procedure;rms, which is the sum of the squared off-diagonal
residuals divided by the degrees of freedom.# we initiate and empty dataframe which will record our criteria values
fit_df <- matrix(nrow = 33, ncol = 4)
colnames(fit_df) <- c("fit", "fit.off", "objective", "rms")
fit_df <- as.data.frame(fit_df)
# we compute the PCA for nfactors between 1 and 33 and record criteria results in fit_df
for (i in 1:33) {
PCA <- principal(df1, nfactors = i, rotate = "varimax")
fit_df[i, 1] <- PCA$fit
fit_df[i, 2] <- PCA$fit.off
fit_df[i, 3] <- PCA$objective
fit_df[i, 4] <- PCA$rms
}
rm(i, PCA)
We plot the results like before.
fit_df %>% gather() %>%
ggplot(aes(x = rep(1:33, ncol(fit_df)), y = value)) +
facet_wrap(~ key, scales = "free") +
geom_point() +
theme_classic() +
scale_x_continuous(breaks = seq(1, 33, 2)) +
geom_vline(xintercept = 10, linetype = "dashed") +
labs(title = "Various criteria for Principal Component Analysis",
x = "Number of components",
y = "")
Here, the ideal number of components seems to be 10, indicated by a dashed line. It is a local minimum of the objective function and is at thw elbow of the other three curves.
However, we check if this result would vary by omitting standalone
variables qd4 and qd23.
# we initiate and empty dataframe which will record our criteria values
fit_df <- matrix(nrow = 31, ncol = 4)
colnames(fit_df) <- c("fit", "fit.off", "objective", "rms")
fit_df <- as.data.frame(fit_df)
# we compute the PCA for nfactors between 1 and 31 and record criteria results in fit_df
for (i in 1:31) {
PCA <- principal(df1[, -c(4,23)], nfactors = i, rotate = "varimax")
fit_df[i, 1] <- PCA$fit
fit_df[i, 2] <- PCA$fit.off
fit_df[i, 3] <- PCA$objective
fit_df[i, 4] <- PCA$rms
}
rm(i, PCA)
Again, we plot the evolution of the criteria.
fit_df %>% gather() %>%
ggplot(aes(x = rep(1:31, ncol(fit_df)), y = value)) +
facet_wrap(~ key, scales = "free") +
geom_point() +
theme_classic() +
scale_x_continuous(breaks = seq(1, 31, 2)) +
geom_vline(xintercept = 8, linetype = "dashed") +
labs(title = "Various criteria for Principal Component Analysis without qd4 and qd23",
x = "Number of components",
y = "")
rm(fit_df)
Without questions 4 and 23, the best model seems to be the one with 8 components.
Thus, we will test these three models.
PCA8 <- principal(df1, nfactors=8, rotate = "varimax")
PCA8_423 <- principal(df1[, -c(4, 23)], nfactors = 8, rotate = "varimax")
PCA10 <- principal(df1, nfactors = 10, rotate = "varimax")
We examine the factor loadings, first by plotting them.
par(mfrow = c(2,1))
corrplot(t(PCA8$loadings),
tl.cex = 0.7,
title="CPA with 8 factors including qd4 and dq23 \n Loadings",
mar = c(1, 1, 3, 0))
corrplot(t(PCA8_423$loadings),
tl.cex = 0.7,
title = "CPA with 8 factors excluding qd4 and qd23 \n Loadings",
mar = c(1, 1, 3, 0))
par(mfrow = c(1,1))
corrplot(t(PCA10$loadings),
t.cex = 0.7,
title = "CPA with 10 factors \n Loadings",
mar = c(1, 1, 3, 0))
In both
PCA8_423 and PCA10, each questions
looks strongly loaded onto one component. In PCA8_423,
question 4 and 23 each have their own component and their loadings are
the strongest ones, in opposition to what we observed for PAF.
However, PCA8 has some less clearer loadings, in
particular when looking at factor RC7. The two groups we
had defined as durability and quality of materials are pushed together
onto one component, but the loadings given to durability are smaller
than when they are allowed to be by themselves onto one component like
in PCA8_423.
We take a look at the numerical values of the loadings, displaying only the ones above 0.3.
print(PCA8$loadings, cutoff = 0.3, sort=TRUE)
##
## Loadings:
## RC2 RC1 RC7 RC5 RC4 RC8 RC3 RC6
## qd9 0.847
## qd14 0.868
## qd19 0.851
## qd21 0.823
## qd24 0.868
## qd2 0.758
## qd5 0.700
## qd7 0.755
## qd12 0.742
## qd16 0.703
## qd15 0.768
## qd17 0.740
## qd26 0.597 0.322
## qd28 0.339 0.590
## qd31 0.692
## qd32 0.720 0.309
## qd1 0.816
## qd10 0.776
## qd20 0.801
## qd27 0.778
## qd3 0.806
## qd11 0.812
## qd13 0.760
## qd30 0.706
## qd6 0.752
## qd8 0.349 0.620
## qd18 0.700
## qd25 0.791
## qd22 0.849
## qd29 0.836
## qd33 0.838
## qd4 0.805
## qd23 0.677
##
## RC2 RC1 RC7 RC5 RC4 RC8 RC3 RC6
## SS loadings 4.682 3.973 3.931 3.544 3.306 2.952 2.862 1.240
## Proportion Var 0.142 0.120 0.119 0.107 0.100 0.089 0.087 0.038
## Cumulative Var 0.142 0.262 0.381 0.489 0.589 0.678 0.765 0.803
print(PCA8_423$loadings, cutoff = 0.3, sort = TRUE)
##
## Loadings:
## RC2 RC1 RC5 RC4 RC7 RC6 RC3 RC8
## qd9 0.847
## qd14 0.868
## qd19 0.851
## qd21 0.823
## qd24 0.869
## qd2 0.754
## qd5 0.704
## qd7 0.758
## qd12 0.750
## qd16 0.708
## qd1 0.822
## qd10 0.778
## qd20 0.803
## qd27 0.774
## qd3 0.807
## qd11 0.819
## qd13 0.763
## qd30 0.706
## qd6 0.762
## qd8 0.339 0.626
## qd18 0.715
## qd25 0.806
## qd22 0.854
## qd29 0.841
## qd33 0.842
## qd15 0.787
## qd17 0.780
## qd32 0.765
## qd26 0.727
## qd28 0.773
## qd31 0.738
##
## RC2 RC1 RC5 RC4 RC7 RC6 RC3 RC8
## SS loadings 4.673 3.908 3.492 3.297 3.020 2.856 2.492 2.337
## Proportion Var 0.151 0.126 0.113 0.106 0.097 0.092 0.080 0.075
## Cumulative Var 0.151 0.277 0.389 0.496 0.593 0.685 0.766 0.841
print(PCA10$loadings, cutoff = 0.3, sort = TRUE)
##
## Loadings:
## RC2 RC1 RC6 RC4 RC8 RC3 RC5 RC9 RC10 RC7
## qd9 0.849
## qd14 0.869
## qd19 0.852
## qd21 0.824
## qd24 0.870
## qd2 0.757
## qd5 0.706
## qd7 0.761
## qd12 0.753
## qd16 0.710
## qd1 0.822
## qd10 0.778
## qd20 0.803
## qd27 0.775
## qd3 0.805
## qd11 0.818
## qd13 0.764
## qd30 0.707
## qd6 0.759
## qd8 0.345 0.628
## qd18 0.698
## qd25 0.805
## qd22 0.854
## qd29 0.841
## qd33 0.842
## qd15 0.785
## qd17 0.779
## qd32 0.763
## qd26 0.725
## qd28 0.772
## qd31 0.735
## qd23 0.972
## qd4 0.989
##
## RC2 RC1 RC6 RC4 RC8 RC3 RC5 RC9 RC10 RC7
## SS loadings 4.714 3.960 3.502 3.300 2.958 2.851 2.476 2.313 1.017 1.011
## Proportion Var 0.143 0.120 0.106 0.100 0.090 0.086 0.075 0.070 0.031 0.031
## Cumulative Var 0.143 0.263 0.369 0.469 0.559 0.645 0.720 0.790 0.821 0.852
Like with PAF, we notice that qd8 is partially loaded
onto the component corresponding to performance. However, compared with
PAF, the loadings are globally slightly higher. The loadings of
questions 4 and 23 are extremely high for PCA10.
The loadings for PCA8 have a lot of variables with low
loadings, which does not seem to be the best option, given that
comparatively the other models look much nicer.
We take a look at the communalities.
sort(PCA8$communality)
## qd23 qd28 qd26 qd31 qd4 qd16 qd8 qd18
## 0.5482221 0.6110891 0.6253977 0.6812121 0.6860378 0.7228627 0.7472794 0.7489624
## qd5 qd17 qd32 qd11 qd12 qd30 qd15 qd21
## 0.7758762 0.7983085 0.7991891 0.8061080 0.8124165 0.8130471 0.8143803 0.8251173
## qd3 qd2 qd13 qd25 qd7 qd33 qd6 qd10
## 0.8318214 0.8341482 0.8381954 0.8392376 0.8480460 0.8514121 0.8526109 0.8693608
## qd9 qd29 qd19 qd20 qd27 qd14 qd1 qd22
## 0.8712220 0.8731078 0.8735513 0.8750159 0.8756005 0.8803126 0.8806015 0.8835707
## qd24
## 0.8966449
sort(PCA8_423$communality)
## qd16 qd18 qd8 qd26 qd5 qd31 qd28 qd30
## 0.7281547 0.7417442 0.7445806 0.7696596 0.7781026 0.7916366 0.8042188 0.8142034
## qd11 qd12 qd21 qd3 qd2 qd13 qd25 qd7
## 0.8172090 0.8230745 0.8252227 0.8320578 0.8362610 0.8401490 0.8514445 0.8541838
## qd33 qd6 qd9 qd10 qd19 qd27 qd20 qd29
## 0.8570641 0.8581700 0.8717486 0.8732182 0.8739140 0.8771753 0.8792219 0.8799137
## qd14 qd32 qd1 qd22 qd24 qd17 qd15
## 0.8809326 0.8882966 0.8907906 0.8911649 0.8980463 0.9009975 0.9012585
sort(PCA10$communality)
## qd16 qd18 qd8 qd26 qd5 qd31 qd28 qd30
## 0.7283694 0.7516579 0.7530489 0.7708933 0.7800535 0.7916076 0.8055633 0.8154692
## qd11 qd12 qd21 qd3 qd2 qd13 qd7 qd25
## 0.8173072 0.8235556 0.8254960 0.8325685 0.8363938 0.8418185 0.8548721 0.8549029
## qd33 qd6 qd9 qd10 qd19 qd27 qd29 qd20
## 0.8571892 0.8608017 0.8717715 0.8735724 0.8741658 0.8785617 0.8801344 0.8802786
## qd14 qd32 qd1 qd22 qd24 qd17 qd15 qd23
## 0.8810422 0.8879448 0.8910183 0.8921397 0.8982019 0.9010692 0.9017948 0.9911283
## qd4
## 0.9965632
Unlike with PAF, in PCA10, the underlying components in
our model, which are dedicated to qd4 and
qd23, explain most of the variance of the two variables.
This gives both models high communalities for all variables, with the
lowest values being around 0.73.
The communalities for PCA8 are lower for many of the
variables, observing this and the unclear loadings, we decide to not
move further with this model.
rm(PCA8)
We display the scree plots.
ggplot(mapping = aes(x = 1:length(PCA8_423$values),
y = PCA8_423$values,
color = (PCA8_423$values >= 1))) +
geom_point() +
geom_hline(yintercept = 1, linetype = "dashed") +
theme_classic() +
labs(title = "Scree plot of PCA8_423",
x = "Component Number",
y = "Eigenvalue",
color = "Eigenvalue >= 1")
ggplot(mapping = aes(x = 1:length(PCA10$values),
y = PCA10$values,
color = (PCA10$values >= 1))) +
geom_point() +
geom_hline(yintercept = 1, linetype = "dashed") +
theme_classic() +
labs(title = "Scree plot of PCA10",
x = "Component Number",
y = "Eigenvalue",
color = "Eigenvalue >= 1")
Again, the rule of taking values above 1 seems slightly more strict than what our analysis seems to indicate. However, both models have a small but distinct elbow, which corresponds to the number of components selected for each.
We take a look at the the total variance explained by each component in our models.
Let us recall that the eigenvalue of a component represents the amount of the total variance explained (TVE) by that component. We can directly fetch these numbers inside our models.
PCA8_423$Vaccounted
## RC2 RC1 RC5 RC4 RC7
## SS loadings 4.6725207 3.9075209 3.4924136 3.2967977 3.0197441
## Proportion Var 0.1507265 0.1260491 0.1126585 0.1063483 0.0974111
## Cumulative Var 0.1507265 0.2767755 0.3894340 0.4957824 0.5931935
## Proportion Explained 0.1792036 0.1498638 0.1339433 0.1264409 0.1158152
## Cumulative Proportion 0.1792036 0.3290674 0.4630107 0.5894516 0.7052668
## RC6 RC3 RC8
## SS loadings 2.85556858 2.49242392 2.33682587
## Proportion Var 0.09211512 0.08040077 0.07538148
## Cumulative Var 0.68530857 0.76570934 0.84109082
## Proportion Explained 0.10951863 0.09559107 0.08962347
## Cumulative Proportion 0.81478546 0.91037653 1.00000000
Thus, with 8 factors the percentage of the total variance which is explained by the model is 84.11%.
PCA10$Vaccounted
## RC2 RC1 RC6 RC4 RC8
## SS loadings 4.7136511 3.9601108 3.5020257 3.29975846 2.95816184
## Proportion Var 0.1428379 0.1200034 0.1061220 0.09999268 0.08964127
## Cumulative Var 0.1428379 0.2628413 0.3689633 0.46895594 0.55859721
## Proportion Explained 0.1677399 0.1409244 0.1246230 0.11742513 0.10526908
## Cumulative Proportion 0.1677399 0.3086643 0.4332873 0.55071245 0.65598154
## RC3 RC5 RC9 RC10 RC7
## SS loadings 2.8508701 2.47567244 2.31255667 1.01696953 1.01117878
## Proportion Var 0.0863900 0.07502038 0.07007747 0.03081726 0.03064178
## Cumulative Var 0.6449872 0.72000759 0.79008506 0.82090232 0.85154410
## Proportion Explained 0.1014510 0.08809923 0.08229459 0.03618986 0.03598379
## Cumulative Proportion 0.7574325 0.84553176 0.92782636 0.96401621 1.00000000
Thus, with 10 factors the percentage of the total variance which is explained by the model is 85.15%.
Both results are quite close and our factors explain most of the variance in the data.
To be coherent with the best model we chose for PAF, we will select
PCA8_423. PCA10 works well from the point of
view of principal component analysis, because its goal is to produce an
empirical summary of the data set. However, the components of PCA do not
necessarily have a causal interpretation, in opposition to PFA, which
aims find underlying constructs of the data, called factors. Since our
PAF analysis showed that qd4 and qd23 are not
adapted to this modelling via factors because of their standalone
nature, to preserve the most meaningful interpretation of our data set,
we will continue by keeping FA8_423 and
PCA8_423.
rm(FA9, PCA10)
As we saw previously, both FA8_423 and
PCA8_423 have showed clear loading patterns. We now look at
the TVE tables again to gain more insight on how the eigenvalues (which
are the sum of squared loadings by factor or component) describe the
portion of each factor or component.
TVE_PAF <- FA8_423$Vaccounted
TVE_PCA <- PCA8_423$Vaccounted
# we rename each factor or component to designate the quality dimensions we have established
colnames(TVE_PAF) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
"Flawlessness", "Features", "Quality of materials", "Durability")
colnames(TVE_PCA) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
"Features", "Flawlessness", "Quality of materials", "Durability")
TVE_PAF %>% kable(digits = 3) %>% kable_styling(full_width = TRUE)
| Serviceability | Performance | Aesthetics | Ease of use | Flawlessness | Features | Quality of materials | Durability | |
|---|---|---|---|---|---|---|---|---|
| SS loadings | 4.569 | 3.577 | 3.404 | 3.121 | 2.781 | 2.675 | 2.324 | 1.935 |
| Proportion Var | 0.147 | 0.115 | 0.110 | 0.101 | 0.090 | 0.086 | 0.075 | 0.062 |
| Cumulative Var | 0.147 | 0.263 | 0.373 | 0.473 | 0.563 | 0.649 | 0.724 | 0.787 |
| Proportion Explained | 0.187 | 0.147 | 0.140 | 0.128 | 0.114 | 0.110 | 0.095 | 0.079 |
| Cumulative Proportion | 0.187 | 0.334 | 0.474 | 0.602 | 0.716 | 0.825 | 0.921 | 1.000 |
TVE_PCA %>% kable(digits = 3) %>% kable_styling(full_width = TRUE)
| Serviceability | Performance | Aesthetics | Ease of use | Features | Flawlessness | Quality of materials | Durability | |
|---|---|---|---|---|---|---|---|---|
| SS loadings | 4.673 | 3.908 | 3.492 | 3.297 | 3.020 | 2.856 | 2.492 | 2.337 |
| Proportion Var | 0.151 | 0.126 | 0.113 | 0.106 | 0.097 | 0.092 | 0.080 | 0.075 |
| Cumulative Var | 0.151 | 0.277 | 0.389 | 0.496 | 0.593 | 0.685 | 0.766 | 0.841 |
| Proportion Explained | 0.179 | 0.150 | 0.134 | 0.126 | 0.116 | 0.110 | 0.096 | 0.090 |
| Cumulative Proportion | 0.179 | 0.329 | 0.463 | 0.589 | 0.705 | 0.815 | 0.910 | 1.000 |
In both PAF and PCA, the top three factors or components explaining the bigger proportions of variance are Serviceability, Performance and Aesthetics. The only change of order between factors is Features and Flawlessness which switch places between the two models. To better visualize the repartition of explained variance among factors, we display two Pareto charts.
ggplot(mapping = aes(x = reorder(colnames(TVE_PAF), -TVE_PAF[4,]), y = TVE_PAF[4,])) +
geom_col() +
geom_line(aes(x = 1:8, y = TVE_PAF[5,]), color = "red") +
geom_point(aes(x = 1:8, y = TVE_PAF[5,])) +
theme_classic() +
labs(x = "Factor",
y = "Proportion of TVE",
title = "Pareto chart of TVE for FA8_423")
ggplot(mapping = aes(x = reorder(colnames(TVE_PCA), -TVE_PCA[4,]), y = TVE_PCA[4,])) +
geom_col() +
geom_line(aes(x = 1:8, y = TVE_PCA[5,]), color = "red") +
geom_point(aes(x = 1:8, y = TVE_PCA[5,])) +
theme_classic() +
labs(x = "Component",
y = "Proportion of TVE",
title = "Pareto chart of TVE for PCA8_423")
rm(TVE_PAF, TVE_PCA)
In both cases, Serviceability is the dimension which is more important, with the biggest gap compared to other dimensions. The Pareto curve is very smooth and flat, which means that the importance of each dimension is a slow decrease. This can also be observed by the fact that gaps are minimal between each bar once we pass Serviceability. This means that even if there are some more important dimensions, all dimensions bring their share in explaining part of the TVE, there does not seem to be a superfluous or unimportant dimension.
We run the OFA with principal axis factoring like before. What changes is the rotation method, which is not orthogonal anymore, but accounts for correlations between factors (thus the term oblique).
OFA8_423 <- fa(df1[, -c(4, 23)], nfactors = 8, rotate = "promax", fm = "pa")
We can fetch the factor scores directly inside our model. We rename the columns as we did with the correlation matrix before and display the first six rows of our factor scores.
factors <- as.data.frame(OFA8_423$scores)
colnames(factors) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
"Features", "Quality of materials", "Flawlessness", "Durability")
head(factors) %>% kable(digits = 2) %>% kable_styling(full_width = TRUE)
| Serviceability | Performance | Aesthetics | Ease of use | Features | Quality of materials | Flawlessness | Durability |
|---|---|---|---|---|---|---|---|
| -1.09 | -0.59 | -1.42 | -0.71 | -0.94 | -1.11 | -1.04 | -0.27 |
| 0.14 | -0.20 | -0.10 | -0.62 | 0.26 | 0.05 | -1.04 | -0.84 |
| -1.09 | -0.54 | -1.41 | -0.83 | -0.82 | -1.11 | -0.20 | 0.17 |
| 0.34 | 1.08 | 1.22 | 0.98 | 1.34 | 0.96 | 1.17 | 0.89 |
| -1.06 | -0.49 | -0.16 | 0.50 | -0.92 | 0.09 | 0.01 | 0.29 |
| -1.10 | -0.49 | -1.46 | -1.37 | -0.92 | -0.75 | -0.04 | -0.52 |
dfreg$brandrec <- as.factor(dfobl$brandrec)
levels(dfreg$brandrec) <- c("Apple", "Samsung", "LG", "Motorola", "Other")
ggplot(data = dfreg, aes(brandrec)) +
geom_bar() +
theme_classic() +
labs(title = "Repartition of brands in dataset",
x = "Brand",
y = "Count")
summary(dfreg$brandrec)
## Apple Samsung LG Motorola Other
## 534 234 63 68 70
lmreg <- lm(wtp ~. , data = dfreg[dfreg$brandrec == "Apple", -c(10,13)])
summary(lmreg)
##
## Call:
## lm(formula = wtp ~ ., data = dfreg[dfreg$brandrec == "Apple",
## -c(10, 13)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.09117 -0.36419 0.06468 0.51468 1.55683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.23695 0.03163 7.492 0.000000000000291 ***
## Serviceability 0.12390 0.04381 2.828 0.00486 **
## Performance 0.03104 0.06392 0.486 0.62745
## Aesthetics 0.14958 0.05106 2.929 0.00354 **
## `Ease of use` 0.20919 0.04912 4.259 0.000024394234904 ***
## Features 0.05796 0.05348 1.084 0.27896
## `Quality of materials` 0.08436 0.05228 1.614 0.10720
## Flawlessness -0.04971 0.04431 -1.122 0.26248
## Durability 0.03166 0.05368 0.590 0.55564
## qd4 0.01536 0.03193 0.481 0.63073
## qd23 0.07032 0.03401 2.067 0.03920 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7065 on 523 degrees of freedom
## Multiple R-squared: 0.3592, Adjusted R-squared: 0.3469
## F-statistic: 29.31 on 10 and 523 DF, p-value: < 0.00000000000000022
lmreg <- lm(ri ~. , data = dfreg[dfreg$brandrec == "Motorola", -c(9,13)])
summary(lmreg)
##
## Call:
## lm(formula = ri ~ ., data = dfreg[dfreg$brandrec == "Motorola",
## -c(9, 13)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5591 -0.3686 0.0016 0.3651 2.2353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.41351 0.09675 -4.274 0.0000738 ***
## Serviceability -0.02048 0.12346 -0.166 0.868837
## Performance 0.29686 0.15383 1.930 0.058618 .
## Aesthetics 0.13237 0.12427 1.065 0.291280
## `Ease of use` 0.45254 0.11961 3.783 0.000373 ***
## Features -0.16613 0.16445 -1.010 0.316664
## `Quality of materials` -0.13988 0.13441 -1.041 0.302392
## Flawlessness 0.06691 0.14153 0.473 0.638161
## Durability 0.11952 0.14783 0.808 0.422179
## qd4 0.11227 0.10625 1.057 0.295134
## qd23 0.01087 0.09431 0.115 0.908618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6805 on 57 degrees of freedom
## Multiple R-squared: 0.6125, Adjusted R-squared: 0.5445
## F-statistic: 9.01 on 10 and 57 DF, p-value: 0.00000001086
ggplot(data = melt(dfreg[, c(1:8, 13)]), aes(y = value, color = brandrec)) +
facet_wrap(~ variable, nrow = 2) +
geom_boxplot() +
theme_classic() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(title = "Boxplots of factor scores per brand",
y = "Factor score",
color = "Brand")